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already given by a Chen-Stein method. Our approach, the f/^-mixing method, gives 
QT^ . local bounds. Since we only need the error in the tails of distribution, the global 
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bounds. We search for two thresholds on the number of occurrences from which 
we can regard the studied word as an over-represented or an under-represented 
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Q \ one. A biological role is suggested for these over- or under-represented words. Our 

method gives such thresholds for a panel of words much broader than the Chen-Stein 
method. Comparing the methods, we observe a better accuracy for the -^-mixing 
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C^ ' PANOW ^ dedicated to the computation of the error term and the thresholds for a 
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1 Introduction 



Modelling DNA sequences with stochastic models and developing statistical 
methods to analyse the enormous set of data that results from the multiple 
projects of DNA sequencing are challenging questions for statisticians and 
biologists. Many DNA sequence analysis are based on the distribution of the 
occurrences of patterns having some special biological function. The most pop- 
ular model in this domain is the Markov chain model that gives a description 
of the local behaviour of the sequence (see Almagor [6|, Blaisdell 10|, Phillips 



et al. [23(], Gelfand et al. [16|). An important problem is to determine the 



statistical significance of a word frequency in a DNA sequence. Nicodeme 



et al. [21[ discuss about this relevance of finding over- or under-represented 
words. The naive idea is the following: a word may have a significant low fre- 
quency in a DNA sequence because it disrupts replication or gene expression, 
whereas a significantly frequent word may have a fundamental activity with 
regard to genome stability. Well-known examples of words with exceptional 
frequencies in DNA sequences are biological palindromes corresponding to re- 
striction sites avoided for instance in E. coli (Karlin et al. [l8|), the Cross-over 
Hotspot Instigator sites in several bacteria, again in E. coli for example (Smith 



et al. 29|], El Karoui et al. [l4j), and uptake sequences (Smith et al. [30|) or 



polyadenylation signals (van Helden et al. 33 



The exact distribution of the number of a word occurrences under the Marko- 
vian model is known and some softwares are available (Robin and Daudin 



28l |. Regnier [25j) but, because of numerical complexity, they are often used 
to compute expectation and variance of a given count (and thus use, in fact, 
Gaussian approximations for the distribution). In fact these methods are not 
efficient for long sequences or if the Markov model order is larger than 2 or 
3. For such cases, several approximations are possible: Gaussian approxima- 



tions (Prum et al. |24|), Binomial (or Poisson) approximations (van Helden 



et al. [32L Godbole [17|), compound Poisson approximations (Reinert and 
Schbath |26|]), or large deviations approach (Nuel [24I). In this paper we 
only focus on the Poisson approximation. We approximate F{N{A) = k) by 
exp{-t¥{A))[tF{A)]''{k\)-^ where F{N{A) = k) is the stationary probability 
under the Markov model that the number of occurrences N{A) of word A is 
equal to /c, P(A) is the probabihty that word A occurs at a given position, and t 
is the length of the sequence. Intuitively, a binomial distribution could be used 
to approximate the distribution of occurrences of a particular word. Length 
t of the sequence is large, P(y4) is small and t¥{A) is almost constant. Thus, 
we use the more numerically convenient Poisson approximation. Our aim is 
to bound the error between the distribution of the number of occurrences of 



word A and its Poisson approximation. In Reinert and Schbath 26j, the au- 
thors prove an upper bound for a compound Poisson approximation. They 
use a Chen-Stein method, which is the usual method in this purpose. This 



method has been developed by Chen on Poisson approximations (Chen [12]) 
after a work of Stein on normal approximations (Stein 3l|)- Its principle is to 



bound the difference between the two distributions in total variation distance 
for all subsets of the definition domain. Since we are interested in under- or 
over-represented words, we are only interested in this difference for the tails of 
the distributions. Then, the uniform bound given by the Chen-Stein method 
is too large for our purpose. We present here a new method, based on the 
property of mixing processes. Our method has the useful particularity to give 
a bound on the error at each point of the distribution. More precisely, it offers 
an error term e, for the number of occurrences k, of word A: 



F{N{A) = k)- 



e-*(^)(tP(A))'= 



k\ 



<e{A,k). 



Moreover, e{A, k) decays factorially fast with respect to k. 

Abadi [l|, y] presents lower and upper bounds for the exponential approxima- 
tion of the first occurrence time of a rare event, also called hitting time, in a 
stationary stochastic process on a finite alphabet with a- or 0-mixing prop- 
erty. Abadi and Vergne [4| describe the statistics of return times of a string 
of symbols in such a process. In Abadi and Vergne [SJ], the authors prove a 
Poisson approximation for the distribution of occurrence times of a string of 
symbols in a 0-mixing process. The first part of our work is to determine 
some constants not explicitly computed in the results of the above mentioned 
articles but necessary for the proof of our theorem. Our work is complemen- 
tary to all these articles, in the sense that it relies on them for preliminary 
results and it adapts them to ^/'-mixing processes. Since Markov chains are 
mixing processes, all these results established for mixing processes also apply 
to Markov chains which model biological sequences. 

This paper is organised in the following way. In section [21 we introduce the 
Chen-Stein method. In section [3l we define a ■?/'- mixing process and state some 
preliminary notations, mostly on the properties of a word. We also present 
in this section the principal result of our work: the Poisson approximation 
(Theorem E]). In section HJ we state preliminary results. Mainly, we recall 
results of Abadi [2], computing all the necessary constants and we present 
lemmas and propositions necessary for the proof of Theorem[5l In section[5l we 
establish the proof of our main result: Theorem [5] on Poisson approximation. 
Using ^/'-mixing properties and preliminary results, we prove an upper bound 
for the difference between the exact distribution of the number of occurrence of 
word A and the Poisson distribution of parameter tP(y4). Section[H]is dedicated 
to numerical results. For the search of over-represented words, we compare our 
method to Chen-Stein method on both synthetic and biological data. In this 
section, we also present results obtained by a similar method, the 0-mixing 
method. We end the paper presenting some examples of biological applications. 



and some conclusions and perspectives of future works. 



2 The Chen-Stein method 



2. 1 Total variation distance 



Definition 1 For any two random variables X and Y with values in the same 
discrete space E, the total variation distance between their probability distri- 
butions is defined by 

dMc{x),c{Y)) = W\nx = ^)-ny = oi • 



We remark that for any subset S oi E 

\¥{XeS)-¥{YeS)\ <dT:y{C{X),CiY)). 

2.2 The Chen- Stein method 



The Chen-Stein method is used to bound the error between the distribution 
of the number of occurrences of a word A in a sequence X and the Poisson 
distribution with parameter tF{A) where t is the length of the sequence and 
F[A) the stationary measure of A. The Chen-Stein method for Poisson approx- 
imation has been developed by Chen |1^; a friendly exposition is in Arratia 
et al. [7| and a description with many examples can be found in Arratia et al. 
[Sj and Barbour et al. [9| . We will use Theorem 1 in Arratia et al. [8| with an 
improved bound by Barbour et al. |9|] (Theorem l.A and Theorem 10. A). 

First, we will fix a few notations. Let ^ be a finite set (for example, in the 
DNA case A = {a, c, g, t}). Put Q = A^. For each x = {xm)m£i ^ ^' "^^ denote 
by X^ the m-th coordinate of the sequence x\ X^^x) = Xm- We denote by 
T : Q ^ Q the one-step-left shift operator: so we will have (T(x))^ = x^+i- 
We denote by JF the a-algebra over Q generated by strings and by J^j the a- 
algebra generated by strings with coordinates in I with J C Z. We consider an 
invariant probability measure P over JF. Consider a stationary Markov chain 
X = (Xj)jg2 on the finite alphabet A. Let us fix a word A = (ai, . . . , a„). For 
i e {1,2,- ■ ■ ,t — n -\- 1}, let Yi be the following random variable 

Yi = Yi{A) = l{word A appears at position i in the sequence} 
= l{(Xi, . . . , Xi+n_i) = (ai, . . . , an)}. 



where 1{F} denotes the indicator function of set F. We put Y = J2iZi~^^yi, 
the random variable corresponding to the number of occurrences of a word, 
¥,{Yi) = rrii and I]*Ii'''^ rrii = m. Then, E(y) = m. Let Z he a Poisson random 
variable with parameter m: Z ^ V{m). For each i, we arbitrarily define a set 
V{i) C {1,2,- ■ ■ ,t — n + 1} containing the point i. The set V{i) will play the 
role of a neighbourhood of i. 

Theorem 2 (Arratia et al. ^], Barbour et al. [91]) Let I be an index set. 
For each i & I , let Yi he a Bernoulli random variable with pi = F{Yi = 1) > 0. 
Suppose that, for each i & I , we have chosen V{i) C / with i G V{i). Let 
Zi,i G /, be independent Poisson variables with mean pi. The total variation 
distance between the dependent Bernoulli process Y_ = {Yi, z G /} and the 
Poisson process Z_ = {Zi,i G /} satisfies 

dTY{C{Y),C{Z))<bi + b2 + bs 



where 



b,=j:i: E(r.)E(F,), 

i jeV{i) 

63 = ^E|E(F,-p,|F,,j^\/(2))|. 



Moreover, ifW = J^iei^i o^t? ^ = J2ieiPi < oo, then 

dTvmW),V{\)) < ^^(6i + 62) +min llj^]bs. 



We think of V{i) as a neighbourhood of strong dependence of Yi. Intuitively, 
61 describes the contribution related to the size of the neighbourhood and the 
weights of the random variables in that neighbourhood; if all Y^ had the same 
probability of success, then 61 would be directly proportional to the neigh- 
bourhood size. The term 62 accounts for the strength of the dependence inside 
the neighbourhood; as it depends on the second moments, it can be viewed as 
a "second order interaction" term. Finally, 63 is related to the strength of de- 
pendence of Yi with random variables outside its neighbourhood. In particular, 
note that 63 = if Yi is independent of {Yj\j ^ V{i)}. 

One consequence of this theorem is that for any indicator function of an event, 
i.e. for any measurable functional h from Q to [0, 1], there is an error bound of 



the form \Eh{Y) -Eh{Z)\ < dTv{.C{Y_), C{Z)). Thus, if S{Y) is a test statistic 
then, for all t G R, 

HS{Y) > t) - ¥{S{Z) > t) < 61 + 62 + &3, 



which can be used to construct confidence intervals and to find p-values for 
tests based on this statistic. 



3 Preliminary notations and Poisson Approximation 

3.1 Preliminary notations 



We focus on Markov processes in our biological applications (see [6]) but the 
theorem given in the following subsection is established for more general mix- 
ing processes: the so called ?/^-mixing processes. 

Definition 3 Let ip = (^(^))£>o be a sequence of real numbers decreasing to 
zero. We say that {Xm)^^i is a ip-mixing process if for all integers i >0, the 
following holds 

mB n r-("+^+i)(C)) - F(Bmc)\ 

sup mfi\mr\ = V'(^), 



where the supremum is taken over the sets B and C , such that P(i?)P(C) > 0. 

For a word AoiVt, that is to say a measurable subset of ^2, we say that A G Cn 
if and only if 

A = {Xq = ao, . . . , Xn-l = Ctn-l/j 



with ttj G A,i = 1, . . . ,n. Then, the integer n is the length of word A. For 
v4 G C„, we define the hitting time ta '■ f2— >NU{cx3}, as the random variable 
defined on the probability space {Q,J-',F): 

Vx G n, Ta{x) = mf{k > 1 : r^(x) G A}. 



Ta is the first time that the process hits a given measurable A. We also use 
the classical probabilistic shorthand notations. We write {ta = rn} instead 
of {x G fi : TAix) = m}, T-''{A) instead oi {x e n : T^(x) G A} and 
{X^ = xf.} instead of {Xr = Xr, ..., X^ = Xs}- Also we write for two measurable 



subsets A and B of Q, the conditional probability of B given A as P(i?|y4) = 
Fa{B) = ¥{B n A)/F{A) and the probability of the intersection of A and B 
by ¥{A n 5) or P(A; 5). For A = {X^~^ = Xq'^} and 1 < w < n, we write 
j[iw) = {X^z^ = a^J^li,} for the event consisting of the last w symbols of A. 
We also write a V 6 for the supremum of two real numbers a and b. We define 
the periodicity pa oi A E Cn as follows: 



PA = inf {k eN*\An T-^iA) ^ 0}. 



Pa is called the principal period of word A. Then, we denote by Tip = TZpin) 
the set of words A G C„ with periodicity p and we also define Bn as the set 
of words A & Cn with periodicity less than [n/2], where [.] defines the integer 
part of a real number: 



np = {Ae Cn\pA = P}, Bn=[j Tip. 

p=l 



Bn is the set of words which are self-overlapping before half their length (see 
Example H]). We define TZiA) the set of return times of A which are not a 
multiple of its periodicity pa'- 

n{A) = {ke {[n/pA]pA + l,...,n-l}\AnT-\A)j^(D}. 



Let us denote r^ = ^71{A), the cardinality of the set 71{A). Define also 
ua = m.m7l{A) if 71{A) ^ and ua = n otherwise. 7^,(^4) is called the 
set of secondary periods of A and ua is the smallest secondary period of A. 
Finally, we introduce the following notation. For an integer s G {0, . . . , t — 1}, 
let X\ = I]i=s 1{T^'(A)}. The random variable N^ counts the number of 
occurrences of A between s and t (we omit the dependence on A). For the 
sake of simplicity, we also put N*^ = Nq. 

Example 4 Consider the word A = aaataaataaa. Since pa = 4, we have 
A & Bn where ri = 11. See the following figure to note that TZ{A) = {9; 10}, 
Tyi = 2 and tia = 9. 

0123456789 10 
aaataaataa a 

a a a t a a a t a a a 

a a a taaataaa 
a a ataaataaa 
a aataaataaa 



3.2 The mixing method 



We present a theorem that gives an error bound for the Poisson approximation. 
Compared to the Chen-Stein method, it has the advantage to present non 
uniform bounds that strongly control the decay of the tail distribution of A^*. 

Theorem 5 (-i/^-mixing approximation) Let (Xm)^^^ be a ip-m^ixing pro- 
cess. There exists a constant C^ = 254, such that for all A E Cn\Bn and all 
non negative integers k and t, the following inequality holds: 



P(Ar* = A;) - 



-tV{A) 



{t¥{A)f 



k\ 



<C^e4A)e-^'-^''^'^-^^<^^^g4A,k) 



where g^p{A, k) 



(2A) 



fc-i 



(fc-l)! 



(2A) 



fc-i 



ki{ 



X 



2t- 

e^(A)' •■■' n- 



[ \e^{A))-\e^iA)) 

e4A)= inf \{rA + n)F(A'^'"A{l + ij{nA-w)) 

l<w<nA L ^ '^ 



1 '^' ^ te^(A)'---' nX 



v<.nA 

and X = tP(A)(l + ^(n)). 



This result is at the core of our study. It shows an upper bound for the 
difference between the distribution of the number of occurrences of word A in 
a sequence of length t and the Poisson distribution of parameter tF{A). Proof 
is postponed in Section [51 



4 Calculation of the constants 



Our goal is to compute a bound as small as possible to control the error 
between the Poisson distribution and the distribution of the number of oc- 
currences of a word. Thus, we determine the global constant C^ appearing in 
Theorem O by means of intermediary bounds appearing in the proof. General 
bounds are interesting asymptotically in n, but for biological applications, n 
is approximately between 10 or 20, which is too small. Then along the proof, 
we will indicate the intermediary bounds that we compute. Before establish- 
ing the proof of that Theorem [5l we point out here, for easy references, some 
results of Abadi [2], and some other useful results. In Abadi (2|, these results 



are given only in the 0-niixing context. Moreover exact values of the constants 
are not given, while these are necessary for practical use of these methods. We 
provide the values of all the constants appearing in the proofs of these results. 

Proposition 6 (Proposition 11 in Abadi [2]) Let{X„i),^^j^ be a ip -mixing 
process. There exist two finite constants Ca> and Cf, > 0, such that for any 



n, any word A E Cn, and any c G An, 



2P(A) 



satisfying 



tP (c/4) < P {{ta < c/A} n {ta o T'/^ > c/2} 



there exists A, with n < A < c/A, such that for all positive integers k, the 
following inequalities hold: 



P 



{ta > kc) -F{ta>c- 2 A) 
'^ {ta >kc)-F {ta > c) 



Pi 



< CaE {A) kF {ta>c- 2A)'' , 

< Cb6 {A) kF {ta>c- 2A)^ , 



with e{A) = inf [£¥{A) + ^(£)]. 



(1) 
(2) 



Both inequalities provide an approximation of the hitting time distribution 
by a geometric distribution at any point t of the form t = kc. The difference 
between these distributions is that in[Tl the geometric term inside the modulus 
is the same as in the upper bound, while in [21 the geometric term inside the 
modulus is larger than the one in the upper bound. That is, the second bound 
gives a larger error. We will use both in the proof of Theorem [HI 

Proposition 7 We have Ca = 24 and Cj, = 25. 



PROOF. For the details of the proof of Proposition [6l we refer to Proposition 



11 in Abadi [2|. For any c G 



4n, 



2V{A) 



and A G [n, c/4], we denote Afj 



\ta° 2^«c+jA y. q_ j/\ I g^nd J\f = {ta > c — 2A} for the sake of simplicity. 
Abadi [2|] obtains the following bound: 

VA; > 2, P {ta > kc)-F {^ff < (a) + {b) + (c), with 



fc-2 



rk-j-1 



{a) = J2F{^fy F{TA>{k-J)c)-F[TA>{k-J-l)c;^f^ 

j=0 
k—2 

{b) = Y.F{Ary \f {ta> {k - J -i)c;Art'-') -f{ta> {k - J -i)c)F {at^ 

j=0 

(c) = P {Aff-'^ \F {ta >c)-F{X)\. 



First, for any measurable B G J^{(i+i)c,{i+2)c+n-i} , we have F (B) + ■?/' (A) < 
3^: (A) < §£ (A). We can also remark that P (A/") > 1/2. Then, by iteration of 
the mixing property, we have the following inequality for all £ G N: 



F(f]^fl;B] <6F{My^siA) 



\.i=0 



We apply this bound in the inequalities (14) and (15) of Abadi [2!] to get 



(a) < E P i^y (6P (A/") 

j=0 

fe-2 



fc-j-2+1 



e(A)) =6(A;-l)£(A)P(Ar)(^-^\ 



(6) < E P {^y (6P (Ar)''-^"'+' £ (A)) = 6{k - l)e (A) P {^f~^\ 

j=0 

We also have (c) < P (A/")^'"^ P (AT; ta o T^-2^ < 2A) < e (A) P (Ar)^"\ 
We obtain ([1]): P (ta > fee) - P (AT)^ < 24fe (A) P (A^)^ 



We deduce (El): 



Then, C„ = 24 and Cb = 25. 



P (ta > A;c) - P {ta > cf < 25ke (A) P (A/')^ 



Theorem 8 (Theorem 1 in Abadi [2]) Let (Xm)^^^ ^^ ^ ip-mixing pro- 
cess. Then, there exist constants Ch > and < Hi < 1 < S2 < 00, such 
that for all n & N and any A G Cn, there exists ^a ^ [^1,22], for which the 
following inequality holds for all t > 0: 



P [ta > ^ j - e-*^(^) 



<C,£(A)/i(A,t), 
with e{A) = inf \eF{A) + ■^{i)] and fi{A, t) = {tF{A) V l)e-'^^^\ 



n<e< 



W{A) 



We prove an upper bound for the distance between the rescaled hitting time 
and the exponential law of expectation equal to one. The factor e{A) in the 
upper bound shows that the rate of convergence to the exponential law is 
given by a trade off between the length of this time and the velocity of loosing 
memory of the process. 

Proposition 9 We have Ch = 105. 



PROOF. We fix c = ^2^. and A given by Proposition El We define 

-logP(rA > C-2A) 



6 



cF{A) 



10 



There are three steps in the proof of the theorem. First, we consider t of the 
form t = kc with k a positive integer. Secondly, we prove the theorem for any 
t of the form t = {k + p/q)c with k,p positive integers and 1 < p < q with 
q = 2^f4T- We also put r = {p/q)c. Finally, we consider the remaining cases. 
Here, for the sake of simplicity , we do not detail the two first steps (for that, 
see Abadi [2(]), but only the last one. Let t be any positive real number. We 
write t = kc + r, with k a positive integer and r such that < r < c. We can 
choose a i such that i < t and i = {k + p/q)c with p, q as before. Abadi [^ 
obtains the following bound: 



P [ta >t)- e-«^'f(^)*| < |P [ta >t)-F{TA>i)\ + \F {ta > t) - e 



-?AlP(A)t 



+ 

The first term in the triangular inequality is bounded in the following way: 



\¥{TA>t) -F{TA>i)\=F[TA>i;TAoT^ <t-i 

< P (ta > kc; Ta o T* < A) 
<P(Ar)^-'(AP(A) + ^(A))) 



The second term is bounded like in the two first steps of the proof in Abadi 
[2(]. We apply inequalities ([T]) and (^ to obtain 



P {ta >t)- e-«-^^^^^* < (3 + CatF{A) + C^ + 2Cb)e{A)e 



-unA)t 



Finally, the third term is bounded using the Mean Value Theorem (see for 
example Douglass |l3|) 

< Un^) (r - ^c) e-«^'^(^)* < £(A)e-«-^^(^)*. 



g-€AP(^)t _ g-€AP(A)t 



Thus we have P (r^ > t) - e-«^^(^)* < I05e (A) fi{A,^ At) and the theorem 
follows by the change of variables t = ^At- Then Ch = 105. 

Lemma 10 (-^m)mez ^^ ^ ip-mixing process. Suppose that B (1 A E J-'{o,...,b}, 
C G J^{h+g,...,oo} with b,g eN. The following inequality holds: 

^a{B nC)< Pa(5)P(C)(1 + ij{g)). 



11 



PROOF. Since B C A, obviously F{AnBnC) = ¥{BnC). By the ^/'-mixing 
property F{B nC) < P(S)(P(C) + ^{g)). We divide the above inequahty by 
F{A) and the lemma follows. 



For all the following propositions and lemmas, we recall that 



e^A) 



inf 

l<w<nA 



[rA + n)F[A'^'"^){l + ij{nA-w)) 



Proposition 11 Let (-^m)mez ^^ '^ ip-mixing process. Let A G TZp{n). Then 
the following holds: 

(a) For all M,M' >g>n, 

\¥a {ta>M + M') - Fa {ta > M) F {ta > M')\ 
<Fa{ta>M- g) 2gF{A) [1 + i,{g)] , 

and similarly 

\Fa {rA>M + M') - Fa {ta > M)F {ta > M' - g)\ 
<FA{TA>M-g)[gFiA) + 2i,ig)]. 

(h) For allt>pen, with Ca = Fa{ta > Pa), 

\Fa {ta >t)- CaP {ta >t)\< 2e4A). 

The above proposition establishes a relation between hitting and return times 
with an error bound uniform with respect to t. In particular, [b) says that 
these times coincide if and only if Ca = 1? namely, the string A is non-self- 
overlapping. 

PROOF. In order to simplify notation, for t E "L, Ta stands for ta o T^- 
We introduce a gap of length g after coordinate M to construct the following 
triangular inequality 



\ {ta>M + M') - Fa {ta >M)F {ta > M') 



< 



+ 



PA(r^ >M + M')-Fa [ta > M-t'^'^^' >M'-g 



[M+g] 



-[M+g] 



Fa [ta > M; r^^^^ > M' - g) -Fa (ta > M)F (ta > M' 
+ Fa (ta > M) \F{ta >M' -g)-F{TA> M')\ . 

Term ([3]) is bounded with Lemma [10] by 

Fa [ta > M; rjf'l < (?) < Pa (ta > M - g) gF{A) [1 + ij{g)] . 



(3) 

(4) 
(5) 



12 



Term (jlj) is bounded using the ■?/'- mixing property by P^ {ta > M) ip{g). The 
modulus in ([5]) is bounded using stationarity by P(t^ < fi') < fi'IP(^)- This 
ends the proof of both inequahties of item (a) . 

Item (6) for t > 2n is proven similarly to item (a) with t = M + M', M = p, 
and g = w with 1 < w < ua- Consider now p < t < 2n. 

Ca - Pa (ta >t)=¥A{p<TA<t)= Wa {ta e n{A) U{n<rA< t)) < e^A). 



First and second equalities follow by definition of ta and TZ{A). The inequality 
follows by Lemma [TUl 



Let (a = ^AiTA > Pa) and h = 1/(2P(A)) - 2A, then ^a = -2 logP(r^ > h). 

Lemma 12 Let {Xm)^^^ be a ^p -mixing process. Then the following inequality 
holds: 

\U-CA\<ne4A). 

Hence, we have 

(a - lle^A) <U<Ca + lle^A). 

PROOF. 



^irA>h) = l[¥ {ta > i\rA > ^ - 1) = 11(1 - P {t-\A)\ta > ^ - l)) 

i=\ " ' 

fl(l-P.P(^)) 



h 



i=l 



where p,- = — ; — . Therefore 



PA h 

a + 2^iog(i-p,p(A))-2 Y. CaP(a) 

<2 J2 |-log(l-p,P(A))-CAP(A)|. 

The above modulus is bounded by 

I- log(l - p,F{A)) - p,P(A)| + |p, - CaI P(A). 
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Now note that \y — {l — e ^)| < (1 — e ^)^ for y > small enough. Apply it with 
y = — log(l — piF{A)) to bound the most left term of the above expression by 
lpiF{A)Y. Further by Proposition [IH (6) and the fact that P (r^ > h) > 1/2 
we have 



for all i = pa + I, ■ ■ ■ ,h. Yet as before 

PA 

- Y. log(l - p,¥{A)) < PA (p.P(A) + (p.P(A))2) < e^A). 
1=1 

Finally, by definition of h 

< AAF{A) + 2paF{A) < 6e^{A). 



2 J2 CAP(A)-a 



This ends the proof of the lemma. 

Proposition 13 Let {Xm)^^i be a ip-mixing process. Then the following in- 
equality holds: 

|P(r^ > t) - e-*(^)| < Cpe4A){tF{A) V i)e-<^(A-ue^(A)W(A) _ 



PROOF. We bound the first term with Theorem [8] and the second with 
Lemma I 



|P(^^ > t) _ e-*(^)| < \F{rA >t)- e-«^*(^)| + |e-«^*(^) - e-*(^)| 
|P(rA > t) - e-«^*(^)| < C;,£(A)e-«^*(^) < Che4A)e'^<^-^^^^^'^^^'^^^'> 

< lltP(A)e^(v4)e-(f-4-iie4A))iP{A)_ 

This ends the proof of the proposition with Cp = Ch + H- 

Definition 14 Given A e C„, we define for j G N, t/ie j-i/i occurrence time 
of A as the random variable Ta : fi — > N U {cxd}, defined on the probability 
space (n, JF, P) as follows: for any x eVL, t\ (x) = ta{x) and for j > 2, 

Tli\x) = mi{k > Tli'^\uj) : T'^ix) G A}. 
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Proposition 15 Let {Xm)m^i be a ip-mixing process. Then, for all A ^ Bn, 

all fc G N, and all < ti < t2 < ■■■ < t^ < t for which min {tj — tj_i} > 2n, 

there exists a positive constant Ci independent of A, n, t and k such that 



p n 



T 



^''-t,);rr''>t]-nA/iiv, 






where Vj = F{ta > {tj - t 



j-i) 



2n). 



PROOF. We will show this proposition by induction on k. We put Aj = 

tj — tj_i for j = 2, ...,k, Ai = ti and A^+i = t — tk- Firstly, we note that by 
stationarity 

F{TA = t)=F{A;TA>t-l). 



For A; = 1, by a triangular inequality we obtain 



P rA = ti;rX^>t -P(A)n^: 



< 

+ 
+ 

+ 






F[rA = t^;rX'>t] -P r^ = ti;iV;^ 



P TA = ti;<+2„ = 0)-F{ta = t,)V2 



P(A;r>ti-l)-P(A;iV2V' = 
F{A-Nt,,-' = 0)V2-nA)l[n 



Vo 



ti+2n 



(6) 

(7) 
(8) 

(9) 



Term P is equal to P [ta = ti; UiLt+i T~'{A); Nl^^^n = O) and then 



2n 



(IS])=PU; U T-\A)-Nl = Q\. 

\ ie7?.(A)Ui=l / 

Since A ^ S„, for 1 < z < pA, the above probability is zero. Thus, using 
mixing property 



2n 



<PU; U T-(A);iV*„ = 

\ iG-R.(A)UJ=PA 

2P(A)P(A)(rA + n)(l + V(n))P (iV*n = O) 



< 
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Term ([7]) is bounded using T/'-mixing property 



Analogous computations are used to bound terms ([8]) and ([9]) 



Now, let us suppose that the proposition holds for fc — 1 and let us prove it 
for k. We put Si = {r^ = U}. We use a triangular inequality again to bound 
the term in the left hand side of the inequality of the proposition by a sum of 



:ive terms: 

k 



p n( 



-(i) 



fc+i 



Vi=i 






<i + ii + ni + iv + v. 



< 



P [n '5,;rr^) > tj - p msr,N!;::^':, = o;T-HAy,Nl^, = o 

P mSj; Nli:^^, = 0; [j T~\A)- T-'-{A)- Nl_,, = 

\j=l i=tk-2n+l 

(P(A)(1 + ^lj{n))f{l - ^{n)) (npA + (ta + n)P(A("'))) e-(*-(3'=+iH^(^), 



// 



n 



^k-i \ 

n 5,; iv^^-^^^i = P (A; iv*-*'^ = o) 

< P I n ; <G+i = J P (A; iV*-*'= = O) tP{7 

< (P(A)(1 + ^(n)))V(n)e-(*-(3^+i)")^(^), 

III = P ( Q '5,; ivt-^-i = j - p ms,- Nli:^^, = 

<p(n5,;ivt-+i = 0; u' r-(^)]p(A) 

\j=l ifc-2n+l / 

< 2P(A)(P(A)(1 + ^(n)))'^e-(*-(3fc+i)n)P{A)_ 

We use the inductive hypothesis for the term IV and the case with k = 1 for 
the term V. 



P (A; N^-^" = 



IV 



p I 'f]sy, <-+! = 1 - p(A)^-^ n n- 

k 



P (A; A^i 



t-tfe 







< Ci{k - 1)(P(A)(1 + V^(n)))'^e^(A)e-(*-(3fc+^)"W^), 
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k 



< 2(P(A)(1 + ^(n)))'^e^(A)e-(*-(3'=+iWP(^). 
Finally, we obtain 



fc 



I + II + III + IV + V <{3 + Ci{k-l) + 2){F{A) + ^(n))%(A). 



To conclude the proof, it is sufficient that Cik = 3 + Ci{k — 1) + 2, therefore 
Ci = 5. This ends the proof of the proposition. 



5 Proof of Theorem [5] 



In this section, we prove the main result of our work (see Section 13.2^ : an 
upper bound for the difference between the exact distribution of the number 
of occurrences of word A and the Poisson distribution of parameter tF{A). 
Throughout the proof, we will note in italic the terms computed by our soft- 
ware PANOW (see Section [6?I]). 



PROOF. For k = 0, the result comes from Proposition [13] (P(A^* = 0) 

P(rA >t)). 

For k > 2t/n, since A ^ i3„, we have P(A^* = A;) = 0. Hence, 



P(Ar* = k)- 



e-*(^)(tP(A))'= 



k\ 



< 



e-mA)^tF{A))'' 

k\ 
{t¥{A))''~HF{A) 

{k-iy. V~ 
1 (tp(A))^-l 



^2 (A;-l)! ^-^^^^- 

Indeed, since i < f then *^ < !^ < £il^. 

Now, let us consider 1 < k < 2t/n. We consider a sequence which contains 
exactly k occurrences of A. These occurrences can be isolated or can be in 
clumps. We define the following set: 
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We recall that we put Vj 



[ta > {tj 



h-i) 



2n), Aj = tj — tj_i for 



min |A,|. We say that 



j = 2, ..., k, Ai = ti and Afc_|_i = t — tk- Define I{T) 

the occurrences of A are isolated if /(T) > 2n and we say that there exists at 
least one clump if I{T) < 2n. We also denote 



Bk = {T\I{T) < 2n} and Gk = {T|/(T) > 2n} . 



The set {A^* = k} is the disjoint union between Bk and Gk, then 

F{N' = k) = nBk)+nGk), 



P(iV* = A;) - 



-*P(^)(tP(A))* 



k\ 



< P(5fe) + 



PlGfc) - 



'*(^)(tP(A))' 



A;! 



We will prove an upper bound for the two quantities on the right hand side 
of the above inequality to conclude the proof of the theorem. 

We prove an upper bound for P(Bk). Define C{T) = Z]j=2 l{Aj>2n} + 1- 
G{T) computes how many clusters there are in a given T. Suppose that T 
is such that G{T) = 1 and fix the position ti of the first occurrence of A. 
Further, each occurrence inside the cluster (with the exception of the most 
left one which is fixed at ti) can appear at distance d of the previous one, with 
Pa ^ d < 2n. Therefore, the ■^/'-mixing property leads to the bound 



PI u r(ti,t2,...,tfcj 



< 



k 

n 



U T-'^{A) 

n/2<ti-f-i—ti<2n; 
i=2,...,k 

<P(A)ev,(A)^-ie^(A)e-(*-(3'=+i)")^(^\ 



v 



(10) 



/ 



Suppose now that T is such that C{T) = i. Assume also that the most left 
occurrence of the i clusters of T occurs at t(l), . . . , t(?), with 1 < t(l) < . . . < 
t{i) < t fixed. By the same argument used above, we have the inequalities 



PI U T(ti,...,4) 

jti,...,tfc}\{t(i),...,t{j)} 



j-i 



<(P(A)(l + ^(n))ne^(A) 



•^k-i -{t-{3k+l)n)V{A) 



To obtain an upper bound for P (Bk) we must sum the above bound over all 
T such that C{T) = i with i running from 1 to /c — 1. Fixed C{T) = i, the 
locations of the most left occurrences of A of each one of the i clusters can be 
chosen in at most CI many ways. The cardinality of each one of the i clusters 
can be arranged in Clz\ many ways. (This corresponds to breaking the interval 
(1/2, k + 1/2) in i intervals at points chosen from {1 + 1/2, . . . ,k — 1/2}.) 
Collecting these informations, we have that P (Bk) is bounded by 



fe-i 



J2 CiCi-\{F{A){l + V^(n)))V^(A)'=-'e-(*-(=^^+^)")^(^) 



i=l 



< e-(*-(3'=+i)")P(A)e^(^).. max 



< e-(*-(3fc+l)n)P(A)g^(^) 



i<i<fc-i 



i\ 



E ciz\ 



i=l 



(2A)' 



{k-l)\ 



{2\Y 



fe-i 



e^{A) 



y\e^{A)) 



k-1- 



k< 



k> 



;m 



e^(A) 



e^(A) 



This ends the proof of the bound for P (5^ 



fe-i 



j=i 



We compute F{Bk) < E QC'^-1(P(^)(1 + ^(n)))'e^(A)^-'e-(*-(3fc+i)"W^). 



We prove an upper bound for 



nok 



k\ 



by four terms by the triangular inequality 



It is bounded 



E 

TgGfc 



fc+1 



p n(^?=^.);^r'^>M-n^)'n^. 



Vi=i 



j=i 



fc+i 



fc+i 



n^.-n 



+ E nA) 

+ Y. ^{Af\e 

TGGk 

#6-^/^1 e-*(^)(tP(A))'^ 



,-(A,-2n)P(A) 



i=i j=i 

(i-2(fc+l)n)P(yl) 



-tP{A) 



-tP(A) 



{tF{A)f 



t^ 



k\ 



k\ 



We will bound these terms to obtain Theorem [51 
First, we bound the cardinal of Gk 



(11) 

(12) 
(13) 

(14) 



t" 



#G. < Cf < -, 
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Term (fTTj) is bounded with Proposition 



t'^ 






Term (TT^ is bounded with Proposition [T^ 



-i-k fc+lj^l fc+1 

fc! 

where Cp is defined in Proposition [T^ 
VKe compute 



mA)ik+i 

^^-(A;-l)! k 

[(8 + Cat¥{A) + Ca + 2Cb)e{A) + llt¥{A)e4A)] e-(C4-iie^(A))iP(A)_ 

Term (IT3l) is bounded by 

To bound term ( TT^ . we bound the following difference 



i^Gkkl _ ^ 



t^ 



< 






A; (A; + 4n) 



Then, we have 

fc(A; + 4n)e-*(^)(tP(A))'= 



<m< 



t k\ 



Now, we just have to add the five bounds to obtain the theorem with the 
constant C^ = 1 + Ci + 2Cp + 8 + 8. Proposition [15] shows that Ci = 5 and 
Proposition [T3] with Theorem [8] that Cp = 116 . Then, we prove the theorem 



with a^ = 254. 
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6 Biological applications 



With the explicit value of the constant C^ of Theorem [5], and more particularly 
thanks to all the intermediary bounds given in the proof of this theorem, we 
can develop an algorithm to apply this formula to the study of rare words in 
biological sequences. In order to compare different methods, we also compute 
the bounds corresponding to a 0-mixing, process for which a proof of Poisson 
approximation is given in Abadi and Vergne [5|] . Let us recall the definition of 
such a mixing process. 

Definition 16 Let (p = (0(^))^>o be a sequence decreasing to zero. We say 
that (Xm)„g2 ^'^ ^ (p-mixing process if for all integers i > 0, the following 
holds 

|p(5 n r-("+^+i)(C)) - p(fi)P(C)| 

sup ^T-^T = (P{i), 



where the supremum is taken over the sets B and C, such that ¥{B) > 0. 

Note that obviously, ■^/'-mixing implies 0-mixing. Then, we obtain two new 
methods for the detection of over- or under-represented words in biological 
sequences and we compare them to the Chen-Stein method. 

We recall that Markov models are T/^-mixing processes and then also 0-mixing 
processes. Then, we first need to know the functions ip and for a Markov 
model. It turns out that we can use 

^(£) = 0(£) = Ku^ with fsT > and < z/ < 1, 



where K and u have to be estimated (see Meyn and Tweedie 19|). There are 
several estimations of K and z/. We choose z/ equal to the second eigenvalue 

of the transition matrix of the model and K = ( inf ^m i^^ifci /ij) where |^| 
is the alphabet size, k the order of the Markov model and n the stationary 
distribution of the Markov model. 

We recall that we aim at guessing a relevant biological role of a word in a 
sequence using its number of occurrences. Thus we compare the number of 
occurrences expected in the Markov chain that models the sequence and the 
observed number of occurrences. It is recommended to choose a degree of sig- 
nificance s to quantify this relevance. We fix arbitrarily a degree of significance 
and we want to calculate the smallest number of occurrences u necessary for 
¥{N > u) < s, where N is the number of occurrences of the studied word. If 
the number of occurrences counted in the sequence is larger than this u, we 
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can consider the word to be relevant with a degree of significance s. We have 



¥{N > u) <J2 (Pp(^^ = k) + Err or (k)) 

k=u 



where Pp(A^ = k) is the probabihty under the Poisson model that A^ is equal 
to k and Error{k) is the error between the exact distribution and its Pois- 
son approximation, bounded using Theorem [51 Then, we search the smallest 
threshold u such that 



+00 

5^ (Pp(Ar = k)+ Error{k)) < s. (15) 

k=u 



Then, we have ¥{N > u) < s and we consider the word relevant with a degree 
of significance s if it appears more than u times in the sequence. 

In order to compare the different methods, we compare the thresholds that 
they give. Obviously, the smaller the degree of significance, the more relevant 
the studied word is. But for a fixed degree of significance, the best method is 
the one which gives the smallest threshold u. Indeed, to give the smallest u is 
equivalent to give the smallest error in the tail of the distribution between the 
exact distribution of the number of occurrences of word A and the Poisson 
distribution with parameter tF{A). 



6.1 Software availability 



We developed PANOW, dedicated to the determination of threshold u for given 
words. This software is written in ANSI C++ and developed on x86 GNU/Linux 
systems with GCC 3.4, and successfully tested with GCC latest versions on 
Sun and Apple Mac OSX systems. It relies on seq++ library (Miele et al. [201]). 

Compilation and installation are compliant with the GNU standard procedure. 
It is available at http : //stat . genopole . cnrs . f r/sof tware/panowdir/. On- 
line documentation is also available. PANOW is licensed under the GNU General 
Public License (http://www.gnu.org/licenses/licenses.html). 
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6.2 Comparisons between the three different methods 



6.2.1 Comparisons using synthetic data. 

We can compare the mixing methods and the Chen-Stein method through the 
values of threshold u obtained with PANOW using Abadi and Vergne [^ in the 



first case and Reinert and Schbath 26 in the second one. We recall that the 



method which gives the smallest threshold u is the best method for a fixed 
degree of significance. Table [1] offers a good outline of the possibilities and 
limits of each method. It displays some results on different words randomly 
selected (no biological meaning for any of these words). Table [T] has been 

Table 1 

Table of thresholds u obtained by the three methods (sequence length t equal to 

10®). For each one of the three methods and for each word, we compute the threshold 

which permits to consider the word as an over-represented word or not, for degree 

of significance s equal to 0.1 or 0.01. IMP means that the method can not return a 

result. 



t = 10® 



Words s = 0.1 s = 0.01 





CS 





V' 


CS 


</> 


V' 


cccg 


IMP 


IMP 


IMP 


IMP 


IMP 


IMP 


aagcgc 


IMP 


1301 


378 


IMP 


1304 


392 


cgagcttc 


18 


38 


18 


IMP 


40 


22 


ttgggctg 


14 


27 


14 


18 


29 


17 


gtgcggag 


16 


32 


16 


22 


34 


20 


agcaaata 


19 


39 


19 


IMP 


41 
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obtained with an order one Markov model using a random transition matrix 
and for a degree of significance of 0.1 and 0.01. IMP means that the method 
can not return a result. There are several reasons for that and we explain them 
in the following paragraph. Analysing many results, we notice some differences 
between the methods. 

Firstly, none of the methods gives us a result in all the cases. We recall that 
the Chen Stein method gives a bound {CS) using the total variation distance. 
If the degree of significance s that we choose is smaller than the bound of 
Chen-Stein, we never find a threshold u such that 

+ 00 

CS+Y.^AN = k) < s. 
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Then, each time that the given bound is higher than the significance degree, 
use of the Chen Stein method is impossible. Therefore there are many examples 
that we can not study with this method. Obviously, it is interesting to have a 
small degree of significance s and that may be impossible by this restriction 
of the Chen-Stein method. For example, this problem appears for the words 
aagcgc and cgagcttc in Tabled! For this second word, the Chen-Stein bound 
is equal to 0.0107954. Hence, we can use this method for a significance degree 
s equal to 0.1 but not for a significance degree of 0.01. The same phenomena 
appears for the word agcaaata (the Chen-Stein bound is equal to 0.0120193). 

The 0- and -i/'-mixing methods are not based on the total variation distance. 
Then, whatever the degree of significance s and if the studied word satisfies the 
three following weak properties, we always give a threshold u, contrary to the 
Chen Stein method. In spite of these three conditions, our methods enable us 
to study a much broader panel of words than the Chen-Stein method. Indeed, 
for these two methods, the only problematic cases arise either when function 
e^ (see Theorem |5]) is larger than 1 or for a "high" parameter of the Poisson 
distribution ("high" means larger than 500) or when the word periodicity is 
smaller than half its length (see assumptions in Theorem O A ^ i3„). In fact, 
the first case does not occur very frequently (in any case in Table [T]). The 
reason why the function e^ (or a similar function in the 0-mixing case) has 
to be smaller than 1 is that, for numerical reasons, the error term has to be 
decreasing with the number of occurrences k and without this condition on 
e^ we can not ensure this decrease. We have to compute error terms for a 
finite number of values of k but in order to reduce the computation time, 
when error term becomes smaller than a certain value (we choose 10"'^'^°), we 
suppose all the following error terms equals to this value. That is why error 
term has to be decreasing. The second problem, a "high" parameter of the 
Poisson distribution, is just a computational difficulty and once again it does 
not occur very frequently (only for the word cccg in Table [T] for instance). 
We would like to insist on the main advantage of our methods: we can fix any 
significance degree s and, except in the very rare cases mentioned above, we 
will find a threshold u, contrary to the Chen-Stein method. 

Also, we can use our methods for any Markov chain order. Indeed, PANOW runs 
fast enough contrary to the R program used to compute the Chen-Stein bound 



of Reinert and Schbath [26|. Note that, in program PANOW, we give another 
method to compute the Chen-Stein bound (see Abadi |3|) and this method 
gives approximately the same Chen-Stein bound. 



The second main observation we can make is that, when it works, the Chen- 
Stein method gives either a similar threshold u than the T/^-mixing method, 
or a smaller one. This means that the T/'-mixing method out-performs the 
Chen-Stein method. 



24 



Thirdly we notice that the ■j/'-inixing method is always better than the 0- 
mixing one. Obviously, this result was expected by the definitions of these mix- 
ing processes and also by the theorems because of the extra factor e-(*-(3fc+i)n)P(A) 
(see Theorem O and Theorem 2 in Abadi and Vergne [5|]). We are interested 
by the real impact of this factor on the threshold u: it is significantly better 
in the case of a ^/'-mixing process. 



6.2.2 Biological comparisons. 

Now, we present a few results obtained on real biological examples with order 
one Markov models. There are many categories of words which have relevant 
biological functions (promoters, terminators, repeat sequences, chi sites, up- 
take sequences, bend sites, signal peptides, binding sites, restriction sites,r. . ). 
Some of them are highly present in the sequence, some others are almost 
absent. Then, it turns out to be interesting to consider the over or the under- 
representation of words to find words biologically relevant. 

In this section, we test our methods on words already known to be relevant. 
We focus our study on Chi sites or uptake sequences. Chi sites of bacterias 
protect the genome by stopping its degradation performed by a particular 
enzyme. The function of this enzyme is to destroy viruses which could appear 
into the bacteria. Viruses do not contain Chi sites and then are exterminated. 
It turns out that Chi sites are highly present in the bacterial genome. Uptake 
sequences are abundant sequence motifs, often located downstream of ORFs, 
that are used to facilitate the within-species horizontal transfer of DNA. 

Example 1 

First, we consider the Chi of Escherichia coli, gctggtgg, (see Table |2]), for 

different degrees of significance. We use complete sequence of Escherichia coli 



K12 (Blattner et al. [llj). Sequence length is equal to 4639221. We recall 
that for a fixed significance degree, the smaller the threshold m, the best the 
method is. Then, we can conclude that the -i/^-mixing method gives the most 
interesting results. Chi of E. coli could be considered as an over-represented 
one from 99 occurrences for a significance degree s of 0.0001. Because Chen- 
Stein bound is equal to 0.067726, Chen-Stein method does not permit to 
conclude for significance degrees of 0.01 and 0.001. Moreover, it is well known 
that Chi of E. coli is a very relevant word in this bacteria. Then, we expect 
a very small significance degree for this word. Unfortunately, the minimal 
significance degree which could be obtained by Chen-Stein method is, in fact, 
the Chen-Stein bound: 0.067726. Our method allows to obtain very small 
significance degree and the minimal significance degree for which Chi of E. coli 
is considered as an over-represented word by the ■^/'-mixing method, is given at 
the last line of Table [2j it is equal to 10^^^^. Note also that the thresholds u 
increase with the significance degrees s. To understand this fact, it is sufficient 
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Table 2 

Table of thresholds u obtained by the three methods for the Chi of Escherichia 
coli: gctggtgg (sequence length t equal to 4639221). For each one of the three 
methods we compute the threshold which permits to consider the word as an over- 
represented word or not, for degree of significance s. IMP means that the method 
can not return a result, "counts" correspond to the number of occurrences observed 
in the sequence. 



s 


Chen-Stein 


4>- 


mixing 


^- 


mixing 


counts 


0.1 


87 




193 




83 


499 


0.01 


IMP 




195 




92 


499 


0.0001 


IMP 




197 




99 


499 


10-239 


IMP 




549 




498 


499 



to look at inequality flT^ . But they increase slowly while significance degrees s 
decreases. It could be surprising but it is due to the error term which decreases 
very fast from a certain number of occurrences. 

Example 2 

Second, we consider the Chi of Haemophilus influenzae and its uptake se- 
quence (see Table [3]), for a significance degree s equal to 0.01. We use complete 



sequence of Haemophilus influenzae (Fleischmann et al. [15|). Sequence length 
is equal to 1830138. We observe that in all the cases the ■^/'-mixing method is 

Table 3 

Table of thresholds u obtained by the three methods for the Chi and the uptake se- 
quence of Haemophilus influenzae (sequence length t equal to 1830138). For each one 
of the three methods and for each word, we compute the threshold which permits to 
consider the word as an over-represented word or not, for degree of significance equal 
to 0.01. IMP means that the method can not return a result, "counts" correspond 
to the number of occurrences observed in the sequence. 



Words 


Chen-Stein 


4>- 


-mixing 


V' 


-mixing 


counts 


gatggtgg (chi) 


23 




36 




22 


20 


gctggtgg (chi) 


21 




32 




20 


44 


ggtggtgg (chi) 


16 




IMP 




IMP 


57 


gttggtgg (chi) 


30 




45 




26 


37 


aagtgcggt (uptake) 


13 




17 




13 


737 



the best one because it gives the smallest u, except for the word ggtggtgg 
which has a periodicity less than | (and then we can not study it: see as- 
sumptions in Theorem [5l). We can not assume the good significance of the 



26 



first Chi (gatggtgg) because we count only 20 occurrences in the sequence, 
whereas 23 occurrences are necessary to consider this word as exceptional. On 
the other hand, the uptake sequence is very significant (and then very rele- 
vant). Indeed, we could fix a significance degree equal to 10"^^"^ and consider it 
as an over-represented word from 736 occurrences with the -j/^-mixing method. 
As aagtgcggt is counted 737 times in the sequence, we obtain the well-known 
fact that this word is biologically relevant. 



7 Conclusions and perspectives 



To conclude this paper, we recall the advantages of our new methods. We give 
an error valid for all the values k of the random variable iV* corresponding 
to the number of occurrences of word A in a sequence of length t. Then, we 
can find a minimal number of occurrences to consider a word as biologically 
relevant for a very large number of words and for all degrees of significance. 
That is the main advantage of our methods on the Chen-Stein one which is 
based on the total variation distance and for which small degrees of significance 
can not be obtained. Results of our ?/^-mixing method and the Chen-Stein 
method remain similar but our method has less limitations. Note that our 
methods provide performing results for general modelling processes such as 
Markov chains as well as every (p- and ^-mixing processes. 

In terms of perspectives, as we expect more significant results, we hope to 
improve these methods adapting them directly to Markov chains instead of 
ijj- or 0-mixing. Moreover, it is well-known that a compound Poisson approxi- 
mation is better for self-overlapping words (see Reinert et al. [27| and Reinert 
and Schbath [26|). An error term for the compound Poisson approximation 
for self-overlapping words can be easily derived from our results. 
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